fix: post-audit correctness + safety pass#53
Merged
pratyush618 merged 6 commits intomasterfrom Apr 23, 2026
Merged
Conversation
- sqlite move_to_dlq: propagate cascade errors (match postgres/redis) - redis purge_execution_claims: implement via time-indexed sorted set - traits: mention all three backends in doc comment - clippy: convert sort_by(cmp) to sort_by_key(Reverse) - tests: cross-backend regression for purge_execution_claims
- storage trait: add finalize_fan_out_parent (compare-and-swap) - storage trait: add set_workflow_node_running (explicit timestamped transition) - sqlite impl: conditional UPDATEs that return rows-affected - tests: idempotency, failure branch, and no-op-on-terminal coverage
…e cancel - cache WorkflowSqliteStorage on PyQueue via OnceLock (migrations run once) - wrap SQLite round-trips in py.allow_threads - cancel_workflow_run: iterative BFS with visited set (no recursion deadlock) - check_fan_out_completion: delegate to CAS — exactly-once finalization - build_metadata_json: serde_json::json! for proper escaping - add fail_workflow_node / set_workflow_node_running PyQueue methods - replace silent .ok() with log::warn! via cascade_skip_pending_nodes helper
- _submit_sub_workflow: only promote parent to Running after child submit; mark Failed (not Skipped) on compile/submit error so run can finalize - add _state_lock (RLock) guarding _run_configs/_job_to_run/_child_to_parent/_gate_timers - _cleanup_run: cancel pending gate timers, drop child→parent mappings - narrow broad excepts to (RuntimeError, ValueError) where appropriate - regression: test_sub_workflow_compile_failure_marks_parent_failed
- prometheus: add task_filter param (parity with otel/sentry) - async_support/mixins stub: metrics_timeseries 'interval' -> 'bucket' (match real signature) - Cargo.toml: pin dagron-core to commit d1b61aaf - test_contrib: init _task_filter on handcrafted middleware fixtures
- changelog: Unreleased section covering P0/P1 audit fixes - prometheus: document task_filter parameter - workflows/composition: cover compile/submit-time failure behavior
This was referenced Apr 24, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses the P0 correctness bugs and safety-critical P1s surfaced by the 2026-04-23 parallel audit (see
.claude/reports/audit-summary-2026-04-23.md). Structural refactors (god-object splits, CI backend matrix, workflow Postgres/Redis ports, test-quality sweep, memory/skills refresh) are deliberately scoped out — each is its own focused effort.Fixed — P0 (correctness)
WorkflowStorage::finalize_fan_out_parentcompare-and-swap;check_fan_out_completiondelegates to it so concurrent children cannot double-expand fan-in.purge_execution_claims: implemented via a time-indexed sorted set; the scheduler's maintenance loop now actually purges on Redis.move_to_dlqcascade parity: SQLite propagates cascade errors, matching Postgres/Redis.Fixed — P1 (safety + performance)
py.allow_threads— worker threads no longer serialize on the GIL during DB work.cancel_workflow_runconverted to iterative BFS with avisitedset — no recursion deadlock, no cycle hazard, no connection-pool exhaustion.WorkflowTracker._state_lock(RLock) guards_run_configs,_job_to_run,_child_to_parent,_gate_timers._cleanup_runcancels pending gate timers and drops child→parent mappings for the completing run.build_metadata_jsonswitched toserde_json::json!— proper escaping for backslashes, control chars, Unicode.WorkflowSqliteStoragecached onPyQueueviaOnceLock— migrations run once per instance, not per API call.let _ = ...error swallowing replaced withlog::warn!via a sharedcascade_skip_pending_nodeshelper.except Exception:narrowed to(RuntimeError, ValueError)where safe.PrometheusMiddlewaregainedtask_filter— parity with OTel/Sentry.AsyncQueueMixin.metrics_timeseriesstub signature corrected (interval→bucket).dagron-corepinned to commit SHA — no more silent upstream breakage.Regression tests added
taskito-workflows::tests:test_set_workflow_node_running,test_finalize_fan_out_parent_is_idempotent,test_finalize_fan_out_parent_failure_branch,test_finalize_fan_out_parent_no_op_on_terminal_node.storage_tests::test_execution_claims_purge: cross-backend contract — runs against SQLite today, Redis/Postgres once CI wires them up.test_workflows_subworkflow::test_sub_workflow_compile_failure_marks_parent_failed: guards the tracker P0.Test plan
cargo check --workspace(default, postgres, redis)cargo test --workspace— 77 pass (was 73)cargo clippy --workspace --all-targets -- -D warnings(default, postgres, redis)uv run maturin developuv run python -m pytest tests/python/— 432 pass, 9 skipped (was 427/9)uv run ruff check py_src/ tests/python/uv run mypy py_src/taskito/ --no-incremental